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Probe-level microarray data are usually stored in matrices, where 
the row and column correspond to array and probe, respectively. Sci- 
entists routinely summarize each array by a single index as the ex- 
pression level of each probe set (gene). We examine the adequacy 
of a unidimensional summary for characterizing the data matrix of 
each probe set. To do so, we propose a low-rank matrix model for 
the probe-level intensities, and develop a useful framework for test- 
ing the adequacy of unidimensionality against targeted alternatives. 
This is an interesting statistical problem where inference has to be 
made based on one data matrix whose entries are not i.i.d. We ana- 
lyze the asymptotic properties of the proposed test statistics, and use 
Monte Carlo simulations to assess their small sample performance. 
Applications of the proposed tests to GeneChip data show that evi- 
dence against a unidimensional model is often indicative of practically 
relevant features of a probe set. 

1. Introduction. Oligonucleotide expression array technology is popular 
in many fields of biomedical research. The technology makes it possible to 
measure the abundance of messenger ribonucleic acid (mRNA) transcripts 
for a large number of genes simultaneously. One of them is the Genechip 
microarray technology, which is commercially developed by Affymetrix to 
measure gene expression by hybridizing the sample mRNA on a probe set, 
typically composed of 11-20 pairs of probes, in a specially designed chip 
that is called a "microarray" [Parmigiani et al. (2003)]. 

Two types of probes are used in the Genechip microarray technology, 
the perfect match (PM), which is taken from a gene sequence for specific 
binding of mRNA for the gene, and the mismatch (MM), which is artifi- 
cially created by changing one nucleotide of the PM sequence to control 
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nonspecific binding of mRNA from the other genes or noncoding sequences 
of DNA. The probe pairs are immobilized into an array, where each spot of 
the array contains a probe. An RNA sample labeled with a fluorescent dye 
is hybridized to a microarray, and the array are then scanned. The expres- 
sion levels of different genes can be measured by the intensities of the spots. 
We use PM or PM-MM as the intensity data for our statistical analysis. 
Extensive studies have been carried out on how to summarize the gene ex- 
pression levels based on the probe level data. Li and Wong (2001) proposed 
a multiplicative model: 



where y is the observed intensity of each spot, 9 is the array effect, <p is the 
probe effect, e is the random error, i indicates the ith array and j refers 
to the jth probe. This model, along with some of its variations, has been 
routinely used in microarray data analysis. In the present paper we focus 
on one natural question: how well can we use one quantity 6^ to adequately 
summarize the expression level for each probe set in the ith array? Hu, 
Wright and Zou (2006) show that the least squares estimate (LSE) of the 
parameters in the model can be obtained as the first component of the 
singular value decomposition (SVD) of the intensity matrix Y, where 



Motivated by their work, we aim to develop useful methods to test if ad- 
ditional parameters are needed to characterize the expression data of each 
probe set in each array based on the SVD. 

When we applied the SVD to the 20 GeneChip microarrays produced in 
a recent MicroArray Quality Control (MAQC) project [Shi et al. (2006)] 
for contrasting colorectal adenocarcinomas and matched normal colonic tis- 
sues, we found a number of probe sets (including Probe set "214974_x_at" 
designed to measure the gene expression for Gene "CXCL5") with a signif- 
icant 2-dimensional structure. The first two singular vectors for Probe set 
"214974_x_at" are displayed graphically in Figure 1, indicating that the usual 
unidimensional summary of gene expression (corresponding to the first right 
singular vector) would mask the differential expression of Gene "CXCL5" 
in the tumor tissues. Recent studies, such as that reported in Dimberg et 
al. (2007), show that this gene indeed plays an important role in colorectal 
cancer. More detailed findings about this probe set can be found in Section 5 
together with additional examples. 

In Section 2 we propose a 2-dimensional model to take into account both 
the mean structure and the variance structure of the data matrix. We use 
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Uij = 9i<t>j + 



i = l,...,n,j = l,...,m, 
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Fig. 1. Scatterplot of singular vectors for the probe set "214974-X-at." The probe numbers 
are shown in the lower plot, and the dotted line is given by the least trimmed squares 
estimate. The circles in the upper plot represent the arrays hybridized by the samples from 
the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by 
the samples from the normal colonic tissues. Sections 1 and 5 refer to this figure. 



a multiplicative model extended from Model (1), but the array effects are 
assumed to be random, in consistency with the fact that the arrays are 
typically drawn from a larger population. The LSE of the parameters in 
the model can be efficiently estimated via SVD. We are interested in the 
dimensionality of the mean of this data matrix, but first we need to define 
it in a precise way. 

Definition 1.1. Given an n x m random matrix Y, we define the mean 
matrix as E(Y). If the rank of E(Y) is k, then the dimensionality of Y is 
defined as k, where k £ {1, 2, . . . , min(n, m)}. 

If the rank of E(Y) is k, it is well known that the SVD of E(Y) has 
k nonzero singular values, and E(Y) can be decomposed as Yli=i ^iMjJif j 
where Ai > A2 > • • • > A^ are the singular values, 6 W 1 is the ith left vector 
and S M. m is the ith right vector, for i = 1, 2, . . . , k. Moreover, 
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Our primary question is whether the dimensionality (rank) of the matrix 
E(Y) is one or two. For this purpose, we formulate our hypothesis as 
Ho :E(Y) = \iUiv[ versus H\ : E{Y) = XiUiV^ + X2U.2II2 ■ It is possible to 
consider higher ranks of the mean matrix, but our approach is best illus- 
trated with the rank 2 alternative, which is also the most relevant scenario 
in many applications. In Section 3 three test statistics are proposed for this 
problem and their asymptotic results are given. The asymptotic analysis 
based on the SVD of Y differs from the classical literature on the eigenval- 
ues and eigenvectors of a sample covariance matrix, because the latter works 
on a data matrix with its mean removed, but our focus is directly on the 
mean of the data matrix. 

When the number of microarrays in an experiment is small due to the cost 
concerns, the asymptotic distributions of the statistics proposed in Section 3 
may not be sufficiently close to their exact distributions. Hence, we apply the 
bootstrap techniques to calibrate the first two tests discussed in Section 3. 
In Section 4 we assess the finite sample performance of the tests proposed 
in Section 3 by Monte Carlo simulations. Finally, in Section 5 we apply the 
proposed tests to real data sets from two studies. Our analysis shows that 
the second dimension of the probe-level data is often indicative of interesting 
features of a probe set. A number of scenarios for the inadequacy of a uni- 
dimensional summary are discussed through the case studies and in the 
concluding Section 6. For example, we point out how our approach relates to 
and differs from probe remapping, and show that a high percentage of probes 
of poor binding strengths in a probe set can mask gene expression profiles 
through a unidimensional model. All the proofs of lemmas and theorems 
given in the paper can be found in the supplemental article Feng and He 
(2009). 

2. Model and estimation. In this section we propose a multiplicative 
model extended from Model (1) to account for a possible second dimension 
in the data matrices. Furthermore, the asymptotic properties of the LSE of 
the parameters in the model are discussed. 

2.1. A Multiplicative model with random effects. Our proposed model 
takes the form 

(2) y i = <^ 0) + e2 ) ^+Si. * = i,2,...,n, 

where y. = (yu,yi2, ■ ■■,yim) T is the ith observed vector, 9^ = (9$,.. . ,of}) T 

and 9^ = (#21 > • ■ • ; $2n ) T are use d to explain the row effects, and <fi^ = 

(<j>yi , • • • , 4m) T an d 02°^ = (^21) • • • ' 4m) T are used to explain the column 
effects in the data matrix. When applied to the probe level microarray data, 
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9 stands for the array effect and (ft represents the probe effect. Using || • || 
to denote the L 2 norm for vectors, and a _L b for orthogonality of a and b, 
we make the following assumptions: 

(Ml) 0[ o) and are two m-dimensional unit vectors with <ft^ _L <f>^ . 

(M2) 9j are independently distributed with mean // . = (fiji, ■ ■ ■ , ^j n ) T and 
variance o~j!n, for j = 1) 2, and all the components in each vector are 
independent. The third and fourth central moments of 9^ are 7? and 
Tj, respectively, for j = 1,2. Moreover, // _L // . 

(M3) The error variables = (en, . . . , £i m ) T are identically and indepen- 
dently distributed with mean zero and variance-covariance matrix 
(7 2 I m , and the third and fourth central moments of Sij are 7 3 and 
r 4 , respectively. 

(M4) {9^}, {#2i } and {§4} are mutually independent. 

(M5) n~ 1 ||^ 1 1| 2 — > // 2 and n~ 1 ||/.i 2 || 2 — > fi 2 as n — > 00 for some finite constants 
//1 and fi2- We assume that // 2 + o\ > // 2 + o\ , which is necessary for 
the identifiability of the model parameters. 

(M6) ||// //.|| = O(n), j = 1,2, where indicates the pointwise product 
of two vectors. 



2.2. Least squares estimate of column effect parameters. In this sec- 
tion we discuss the properties of the LSE of the column effect parame- 
ters. Let 9_ x = (9 U , . . . ,9 ln ) T , 9_ 2 = (0 21 , . . . , 9 2n ) T , <p = (g,g) T and = 
(tfi[ ,9 2 , V? T ) T - With the objective function 



(3) dr»(0)=Ell&-01 



^2/ 



2 



2-2 1 



i=l 



the least squares estimate of 1? can be found by minimizing d n (i?). In the 
present framework, the total number of parameters increases with the num- 
ber of observations. To facilitate the analysis, it helps to view 9^ and 2 
as nuisance parameters. If (3) is minimized at 1?, then 9u and 9 2 % minimize 

hi 1| 2 

with respect to #ij and #2i given (ft and </> . Furthermore, 

(4) *i = (&)~ 1y &. 
and 

(5) ^ = (^2)-^^. 
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Therefore, <p minimizes the following objective function: 

(6) d*M = J2hi~ ^fxt^&yjtx - [(^ 2 )" 1 fe]^ 2 ll 2 - 

i=X 

2.2.1. Consistency and asymptotic representation. We consider the asymp- 
totic properties of <p assuming that the number of probes m is fixed but the 
number of arrays n — > oo. As shown in the preceding subsection, <p is a con- 
strained M estimator that minimizes (6) subject to ||$> || = ||</> || = 1 and 
<f> 1 _L <f> . The derivations in the Appendix lead to the following results. 

Theorem 2.1. When Model (2) and assumptions (M1)-(M6) hold, ip —> 
<p(°\ where (p is the least squares estimate of cp( ' , that is, minimizes 
Y^LxPiltiW) subject to || = ||^ 2 || = 1 and ^ _L cf) 2 , where 

( 7 ) p(&w) = hi - dim)*! - fe^ll 2 - 

Theorem 2.1 makes it possible for us to give the Bahadur representation 
for (J) and <f> from the results of He and Shao (1996). We now consider the 
limiting distribution of y/n(ip — </?^), which is critical for us to discuss the 
asymptotic properties of the test statistics proposed in Section 3. Let 

(8) T n = (n-^f + af)^^ T + (n- 1 !!^!) 2 + + ^ 
where J m isanmxm identity matrix. Then we have the following theorem. 



Theorem 2.2. When Model (2) and Assumptions (M1)-(M6) hold, we 
have, for j = 1,2, 

(9) 

+ o(n" 1+e ), 
where e is any positive number, and 

(10) D jn = -2r n + 2^ T T n ^I m + 4<t>f<t>f T T n . 

Thus, both y/n(4> — <f>^) an d V^(^ 2 ~ ^2 ) are as y m Ptotically normally 
distributed with mean and variance-covariance matrix, say, Cx and C 2 , 
respectively, where Cx and C 2 are determined by tp^ and the first four 
moments of y.. 
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2.3. Least squares prediction of row effects. We now discuss the asymp- 
totic properties of the least squares prediction of the row effects based on 
(4) and (5). The result is summarized in the following theorem. 

Theorem 2.3. When Model (2) and assumptions (M1)-(M6) hold, we 

have hi = £y. A 9^ +ej^ and 9 2l = £y. A 9$ +ej^\ where A 
denotes convergence in distribution. 

Let 

(11) r = fa* + a 2 )^^ T + ( M | + a 2 )^^ T + a 2 I m . 

The first two eigenvalues of this matrix are \i\ + o~\ + o~ 2 and \i\ + o~\ + o 2 , 
with the remaining eigenvalues a 2 . Let 

(12) S n = n~ 1 Y T Y - Ti^llY^H 2 - n~ l \\Y^ 2 \\ 2 . 
Then, from (4), (5) and Theorem 2.2, we have 

„-l||gj2 a^ M 2 + (7 2 + CT 3 (j = l,2), 

and 

(m-2)- 1 S n ^a 2 , 

based on the strong law of large numbers. These consistent estimators for all 
the eigenvalues of the matrix T will be used when we construct the tests in 
the following section. On the other hand, we note that *g> and 0® may have 
their individual means Liu and Li 2 i, respectively, and, thus, it is impossible to 
consistently estimate the individual parameters fiu, Li 2 i, a\ and a 2 without 
any further information. 

3. Hypothesis testing. In this section we consider testing the null hy- 
pothesis that Hq:[i 2 = 0. The second dimension 9^(j)^ T in Model (2) does 
not provide meaningful information on the mean structure of the data ma- 
trix under this null hypothesis. We expect 9 2 to have zero mean under the 
null hypothesis and nonzero mean under the alternative hypothesis, because 
9 2 i — > 0$ +^F^2°^ as n ~ y °°' M°ti va t;e-d by this, we construct test statistics 
based on {9 2 i,i = 1, 2, . . . , n}. We consider three specific test statistics in the 
following sub-sections. 

3.1. Test on a target direction. Consider 

(13) Ta = n' l a T 9 2 , 

for any a = (a\, . . . , a n ) T € W 1 such that a T Li 1 = 0, ||a|| 2 = n and maxi<j< n a 2 / 
n — > 0. We choose a vector a such that ai/j because li is orthogonal to 
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fi and we want to test the null hypothesis that fx = 0. We use l n to indi- 
cate the n-dimensional vector with all the components equal to 1. From the 
asymptotic properties discussed in Section 2, we have the following theorem. 

Theorem 3.1. If the observations y 1 ,y 2 i ■ ■ ■ iV are drawn from Model 
(2) and assumptions (M1)-(M6) hold, and a 6 W 1 ' is a vector satisfying 
a T ^ 1 = 0, a T a = n and maxi<j< n a^/n — > 0, then 

n~ l l 2 Je 2 /a 4 N(0, 1) 
under the null hypothesis that fi = 0, where 

(14) a 2 = n~ 1 \\9 a \\ 2 -§^. and h.=n~ x f 2 \ n . 

The power of the test depends on how far a T fi 2 deviates from zero. As to 
the target direction a, it is usually determined by some specific comparison 
in practice. We will give examples of choosing a in Section 5. 

3.1.1. A practical solution when fi is unknown. In practice, the true 
value of the mean vector fi is unknown, but it can be estimated when extra 
group information is available. Assume that the observations can be divided 
into p groups such that fin are equal within each group. We assume that 
Mi.nt-i+i = ••• = Mint, for t = l,2,...,p, where n = < n\ < ■ ■ ■ < < 
n p = n, and assume that p is fixed but nt — nt-i —> oo when n — > oo. For 
microarray data, those arrays that use the same types of tissues may form 
one group, and specific examples will be discussed in Section 5. 

Suppose that fi\ nt is a consistent estimator of fi\ nt Let 

f£l = (Alni j • ■ • , Alni j Aln2 ; ■ • ■ j Mlri2 j ■ • ■ j Aln p ; ■ • ■ ; Aln p ) , 

where the number of fi\ nt m the above vector is nt — nt-i, t = 1, 2, . . . ,p. 
Furthermore, when we choose a vector a orthogonal to fl l , we only consider 
the candidates whose entries can be divided into groups and are equal to 
each other within each group in the form of 

fl OC (cLni 5 • • • 5 Q"n\ i ^ri2 > • • • > Q"ri2 > ■ • ■ j Q>n p i • • • i ^n p ) • 

With a convergent to a, the statistic = n~ l ^9_ 2 has the same Bahadur 
representation as if we chose a vector a orthogonal to fi under the null 
hypothesis. Hence, when we construct the tests in Section 3, we can use a 
that is orthogonal to p,^ The choice of a is not unique, and is best chosen 
in response to specific alternatives of interest in a given experiment. 
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3.2. A x 2 test with multiple directions. As shown in Section 3.1, the 
power of the test Ta depends on the direction a that we choose. In some 
cases, we may consider several directions simultaneously. Let us consider a 
k x n matrix A, where A; is a fixed integer and k < n. The ith row of the 
matrix A is denoted as 0$ and the jth component of o 4 is denoted as a%j for 
i = 1, . . . , k and j = 1, . . . ,n. Assume that a t _L aj for i / j, _L fi , af^ = n 
and maxi<j< n afj/n — > for each i. Then, we propose the test statistic 

T A = n- 1 \\A§ 2 \\ 2 /a 2 , 

with the following result. 

Theorem 3.2. Under the assumptions of Theorem 3.1, and for the ma- 
trix A described in this subsection, we have Ta — > x\ ^ n distribution under 
the null hypothesis that \i = 0, where x\ has the chi-square distribution with 
k degrees of freedom. 

In practice, given observations, we should not choose k that is close to n, 
because 

\\A6 2 \\ 2 = n 2 a 2 

when k = n — 1, and the variations accumulated from approximation errors 
will ruin the chi-square approximation. 

3.3. Bootstrap calibration. Sometimes, the sample size n is too small for 
the asymptotic approximations to perform well. Hence, we propose a finite 
sample adjustment to control the type I errors. 

A bootstrap method, which avoids resampling from the rows or columns 
of the data matrix, to test the null hypothesis that A* 2 = can be described 
as follows: 

(i) Draw n copies . . . ,j n } with replacement from {1,2, ... , n} and let 
®2i = = 1, 2, . . . , 7T.) , where Q% = n' 1 Y17=i ^2i 5 and then eval- 
uate T* as 

T*=n-^ 2 a%/(n-^ 2 \\ 2 -9* 2 2 ^ 2 , 

where £ = (§* 21 , . . . , §* 2n ) T and §*. = n" 1 £™ = i ^ 

(ii) Repeat Step (i) for B times to get the test statistic T* b , b = 1, 2, . . . , B. 
We estimate the bootstrap p- value by 

B 

p = B- 1 Y J I{\Tt b \>\Ta\}. 

6=1 
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To see this bootstrap method work, we note that 



i=l 
n 



i=l 




L #(0,1) 



where y* =y . , and 

n^llLf " % ~ (n-'WllW 2 ~ 0*2 2 ) = Op(l) 

Since 
\ i=i 

under the null hypothesis, the bootstrap method works by Theorem 1 of 
Mammen (1991). Our proposed bootstrap method acts on 62, and avoids 
repeated computations of the SVD. The same idea can be used for Ta- 

3.4. Test based on maximum over directions. If we do not have guided 
directions to look for patterns in // , we may wish to search over a larger 
number of directions. The chi-square test in Section 3.2 does not apply when 
k is large. However, the maximum over k = n — 1 directions, 

(15) M n = max n~ ll2 a^L, 

i<i<n-i ~ J 

has a simple limiting distribution when and 9^ are normally distributed. 
Let 



(16) Cn = y/2ln(n-l) and b n = c n - 2 _1 c~ 1 ln(4vrln(n - 1)). 

Theorem 3.3. Assume the conditions of Theorem 3.1, with the addi- 
tional assumption that and £y are normally distributed. For any matrix 
A as described in Section 3.2 with k = n — 1, we have P(c n (M n /a — b n ) < 
x) — > e~ e x as n — y 00 under the null hypothesis that \i = 0. 

Under the alternative hypothesis, we should observe larger values of M n . 
Furthermore, the convergence rate of the extreme statistic is discussed in 
Section 4.6 of Leadbetter, Lindgren and Rootzen (1983). Based on their 
arguments, we can use [$(n)] n_1 to approximate the probability P(M n /a < 
u) in computing the p-values of the proposed test here. 

The normality of #3 and e 4 is not a necessary condition for the limit- 
ing distribution to hold. Our simulation results not reported in this paper 
suggest that Theorem 3.3 may hold in a much broader setting. 
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4. Simulations. To assess the performance of the proposed tests in the 
present paper, we report Monte Carlo simulation results by simulating data 
from Model (2), with the following specifications. The size of the parameters 
are chosen to mimic some real microarray data: 

(i) 6? is generated from the multivariate iV(/x , 150,000J n ), where /x = 
(4500,4500,... ,4500) T ; 

(ii) 9^ i s generated from N(p, , 10,000/ n ), where [i is equal to either 
(0,0,..., 0) T as the null hypothesis or (125, -125, . . . , 125, -125) T as 
an alternative hypothesis; 

(iii) ^ = {2^y 1 (l,l,...,l) T and ^ = (2V3)" 1 (1, -1, . . . , 1, -1) T are of 
dimension 12; 

(iv) The errors £ij(i = 1,2, ... ,n,j = 1,2, ... , 12) are drawn from three dif- 
ferent distributions in different experiments: the normal distribution 
iV(0,5000), the t-distribution with 5 degrees of freedom multiplied by 
10\/30 and the centered ^-distribution 50(Z 2 — 1), where Z ~ N(0, 1). 

4.1. Test on a target direction. Four different sample sizes are used: n = 
8,16,32 and 128. Furthermore, we chose two different a to compare the 
performance of the tests Ta discussed in Section 3. 

4.1.1. Case 1. In the first case, we choose a = (1, — 1, 1, —1) T , which 
is the ideal choice for detecting the alternative in our settings. We draw 
5000 data sets, and the 5000 p- values are calculated based on the limiting 
distributions in Theorems 3.1. For the test T a , the type I errors are close to 
the nominal level of 0.05 when n > 16. Also clear from Table 1 is that the 
power of the test is decent even when the sample size is as small as 8. 

4.1.2. Case 2. We choose 

a = 2- 1 v / 3(l,-l,...,l,-l) T + 2- 1 (l,...,l,-l,...,-l) T 

to see whether the test has the meaningful power when a is not so well 
chosen to target the true pattern in /x . The results are given in the lower 
half of Table 1. A comparison with Case 1 shows that the power of the test 
T a is sensitive to the choice of a for small n, so a good target direction based 
on the nature of the experiment or the knowledge of the experimenter is very 
valuable. 

4.2. The x 2 test. For the x 2 test °f Section 3.2, four sample sizes n = 
8, 16, 32, 64 are used with the Monte Carlo sample size of 5000. We generated 
k = 4 vectors, which are orthogonal to fi , orthogonal to each other, and 
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Table 1 

Type I errors and powers of the target direction test are listed with increasing sample size 
n. The errors are generated from three different distributions 







Null 






Alternative 




Size 


Normal 


t 


x 2 


Normal 


t 


x 2 


go 


0.0560 


0.0510 


0.0430 


0.6362 


0.6088 


0.5718 


16" 


0.0542 


0.0492 


0.0426 


0.9540 


0.9308 


0.9004 


32 a 


0.0470 


0.0500 


0.0460 


0.9998 


0.9974 


0.9940 


128 a 


0.0522 


0.0508 


0.0532 


1.0000 


1.0000 


1.0000 


8" 


0.0552 


0.0568 


0.0458 


0.4202 


0.4104 


0.3854 


16 6 


0.0530 


0.0500 


0.0490 


0.8358 


0.8190 


0.7840 


32 b 


0.0546 


0.0494 


0.0440 


0.9934 


0.9890 


0.9854 


128 6 


0.0522 


0.0514 


0.0486 


1.0000 


0.9998 


1.0000 


a The results are from Case 1. 










6 The results are from Case 2. 
















Table 2 








Type 


/ errors and 


powers of the \ 2 


test are listed with increasing sample size n. 


The 






errors are drawn from three distributions 










Null 






Alternative 




Size 


Normal 


t 


x 2 


Normal 


t 


2 

X 


8 


0.0000 


0.0000 


0.0000 


0.0000 


0.0000 


0.0000 


16 


0.0296 


0.0264 


0.0236 


0.6406 


0.6104 


0.5734 


32 


0.0418 


0.0384 


0.0394 


0.9918 


0.9846 


0.9822 


64 


0.0464 


0.0504 


0.0422 


1.0000 


0.9990 


1.0000 



are of length n. The algorithm to generate the vectors can be described as 
follows: 




where <S> is the Kronecker product, and the product is repeated n times. After 
the first column of A is deleted, the next k = 4 columns are the vectors we 
use in the x 2 test. The estimated type I errors and powers of the test are 
listed in Table 2. It is clear that the type I error is not close to 0.05 when 
n < 16. In fact, we find that the type I errors in Table 3 from the limiting 
distributions of T a and the % 2 tests can be too high or too low when the 
sample sizes n are small. The bootstrap method manages to control the type 
I errors even at small samples. 

4.3. Test based on maximum over directions. Similar to Table 2, Table 4 
shows the performance of the test M n of Section 3.4 based on the limit- 
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ing distributions. The test is conservative for small n, but remains quite 
powerful in the study. The test can be used even when the normality as- 
sumption in Theorem 3.3 is violated. However, our simulation results that 
are not reported here suggest that if 9^ and do not have finite 4th mo- 
ments, the limiting distribution would not take effect for realistic sample 
sizes considered in this paper. 

5. Case studies. In this section we analyze two microarray data sets. We 
apply our testing methods to search for genes with potentially complicated 
mean structure, and further analyze some of those genes to understand the 
possible causes. The data are quantile normalized in each case. 



Table 3 

Type I errors and powers are listed for comparison between the bootstrap and the 
large-sample approximation. The errors are generated from three different distributions 





Asymptotic approximation 




Bootstrap 




n 


Normal 


t 


x 2 


Norma] 


t 


x 2 


Type I error 














6 a 


0.1174 


0.1096 


0.1004 


0.0420 


0.0416 


0.0350 


ga 


0.0552 


0.0568 


0.0458 


0.0484 


0.0520 


0.0440 


8 6 


0.0000 


0.0000 


0.0000 


0.0406 


0.0396 


0.0256 


16 b 


0.0296 


0.0264 


0.0236 


0.0520 


0.0430 


0.0420 


Estimated power 














6 a 


0.4950 


0.4820 


0.4646 


0.2560 


0.2380 


0.2194 


8" 


0.4202 


0.4104 


0.3854 


0.3738 


0.3670 


0.3480 


8 b 


0.0000 


0.0000 


0.0000 


0.1508 


0.1506 


0.1338 


16 b 


0.6404 


0.6104 


0.5734 


0.7142 


0.6912 


0.6746 


"The results are 


from the test 


on the target direction 2 1 


v/3(l, 


-1) T + 


2- 1 (l,...,l,-l,... 


,-l) T 












'The results are from the \ 2 test based on the four target directions. 








Table 4 








Type I errors and 


powers of the 


test based on maximum over directions are listed with 


increasing sample size n. 


The errors 


are drawn from three distributions 






Null 








Alternative 




Size Normal 


t 


x 2 




Normal 


t 


x 2 


8 0.0018 


0.0012 


0.0008 




0.0270 


0.0268 


0.0232 


16 0.0306 


0.0256 


0.0190 




0.6992 


0.6620 


0.6216 


32 0.0378 


0.0376 


0.0264 




0.9850 


0.9766 


0.9666 


64 0.0428 


0.0404 


0.0362 




1.0000 


0.9988 


0.9994 
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5.1. Example 1. We considered the GeneChip data (http://www.ncbi. 
nlm.nih.gov/geo/query/acc.cgi?acc=GSE5350) obtained from the recent Mi- 
croArray Quality Control (MAQC) project and used in Lin et al. (2006). 
We have a total of 20 microarrays (HG-U133-Plus-2.0), generated from five 
colorectal adenocarcinomas and five matched normal colonic tissues with 1 
technical replicate at each of two laboratories involved in the MAQC project. 

In this study we use PM as the intensity measure in Y, and carry out 
the SVD to get the two largest singular values Ai > A2. We focus on 350 
probe sets with the highest ratios A 2 , /A 2 (with all those ratios above 1/10). 
For each probe set, the probe-level microarray data are stored in a matrix, 
where the rows correspond to the probes and the columns correspond to the 
arrays. The intensities from the normal tissues are entered in the column 

1- 5, 11-15, and those from the tumors entered in the rest of columns. 

We choose a target direction to contrast the two groups in the study. In 
particular, we use 

3.1 oc (-#2, • • -,-p>2, Ai, • • -jfru-fa, -fe,fai, ■ ■ ■ ,£i) T , 

where Ai is taken to be the median of On of the first group (normal tissues), 
and fi2 the median of On of the other group. Hence, we have _L fi, where 

A = (Aii • ■ • > Aii £2, • • • , A2, Ai, . . . , Ai, Az, • • • , £2) T - 

By the statistical test T a developed in Section 3.1, we find that 81 out of 
350 probe sets are detected as individually significant at the 0.05 level. Out 
of those, 36 probe sets remain significant after the multiple test adjustment 
of Benjamini and Hochberg (1995). 

We plot (On, #2i)> 2 = 1,2,..., 20, and (cf>ij , cp2j ) > J ' = 1, 2, . . . , m, for those 
probe-sets that are detected as significant; some interesting facts can be 
observed. We now zoom in on three of those probe sets. 

5.1.1. Probe set "214974-X-at. " In the study the probe set "214974_x_at" 
is used to measure the expression level of Gene "CXCL5." Our test gave the 
p-value of 1.11 x 10 -3 , the adjusted p-value of 2.38 x 10 -2 and the (/-value, 
as proposed in Storey (2003), of 5.77 x 10 -4 , offering significant evidence 
against the unidimensional model. The first four singular values of the data 
matrix are (3387, 1388, 361, 168). As mentioned in the Introduction with 
Figure 1, the arrays cannot be easily separated by the first right singular 
vector, but if we use (On, 621) jointly, the arrays are well separated in the 

2- dimensional space. The usual one-dimensional index of the probe set is 
insufficient to summarize the gene expression of "CXCL5." 

Further inspection of the data shows that the intensities from Probe 3 
are much higher than those of the other probes, and Probe 3 dominantly 
contributes to the values of On- By the Basic Local Alignment Search Tool 



> 
be 



> 
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Fig. 2. Scatterplot of singular vectors for the probe set "214974-Z-at" after we remove 
Probe 3. See Figure 1 for more details about this figure. 



(BLAST, http : //www.ncbi .nlm.nih.gov/blast/Blast . cgi), we found that 
Probe 3 is represented in both Gene "CXCL5" and Gene "N-PAC," but 
the other probes were confirmed as specific to Gene "CXCL5." We further 
confirmed that the intensities of Probe 3 were highly correlated with the 
intensities of several probes in the probe set "208506_x_at" (designed by 
Affymetrix to measure the expression level of Gene "N-PAC"), and thus, we 
need to take Probe 3 with caution. If Probe 3 were removed from the probe 
set, we would have seen a clear separation of the two groups from the first 
singular vector; see Figure 2. In this case, the second singular vector from 
the whole probe set appears to be a better summary of Gene "CXCL5." We 
note that Gene "CXCL5" has been indicated as an important gene for col- 
orectal cancer in the literature. For example, Dimberg et al. (2007) observed 
significantly higher expression levels of the protein encoded by "CXCL5" in 
colorectal cancer tumors than in normal tissue, so the multidimensionality 
of the probe set "214974_x_at" flagged through our statistical work can offer 
biologically relevant information. 

5.1.2. Probe set "227899.at." The probe set "227899_at" is designed by 
Affymetrix to measure the expression level of Gene "VIT." Our test gave 
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Fig. 3. Scatterplot of singular vectors for the probe set "227899-at." The probe numbers 
are shown in the lower plot, and the dotted line is given by the least trimmed squares 
estimate. The circles in the upper plot represent the arrays hybridized by the samples from 
the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by 
the samples from the normal colonic tissues. 



the p-value 8.78 x 10 4 , the adjusted p-value 2.38 x 10 2 and the (/-value 
5.77 x 10~ 4 . The first four singular values are (3178,1011,227,77). 

From Figure 3, we note that differential expression can be detected from 
the second right singular vector, but not the first. From the probe-level 
data, we find that the intensities of Probe 4 and Probe 7 are much higher 
than those of the other probes, and these two probes dominate the first 
two singular vectors. Furthermore, we confirmed by BLAST both probes 
as specific for measuring the expression level of Gene "VIT," and so did 
the other probes. As a double check, we applied the remapping method 
proposed by Lu et al. (2007) and confirmed all the probes in this probe set 
were specified for the three transcript variants for Gene "VIT." Therefore, 
a 2-dimensional summary of the gene appears necessary for this probe set. 

To make the point further, we provide the absolute value of percentages 
calculated from M\ and M2 in Table 5, where 

(0ii</>ii ••• ^ll^lm 
; ; ; 
01n011 • • • 01n01m 
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and 

(#21021 ••• 2 \4>2m\ 
#2n021 ••• 02n4>2m/ 

It is clear that the information contained in the second dimension for 
Probes 4 and 7 is important, because in more than half of the arrays their 
contributions from the second dimension are more than 20% of those from 
the first. The joint use of On and Q^i gives a more complete picture about 
the expression profile of Gene "VTT." 



5.1.3. Probe set "1560296-at." The probe set "1560296_at" is used in 
the HG-U133-Plus-2.0 platform to represent Gene "DST." This probe set 
is detected by our test with a significant 2-dimensional mean structure (p- 
value 1.88 x 10~ 3 , adjusted p-value 2.87 x 10~ 2 and g-value 6.96 x 10~ 4 ). 
The first four singular values are (5470,1748,504,271). 

From Figure 4, we observe that the probes 1 and 2 are dominant probes. 
Further inspection shows that the first singular vector is primarily deter- 
mined by these two probes. Following the method of Lu et al. (2007), 
we find that Probes 1, 2 and 3 are remapped to three transcripts each 
("veejee.aApr07-unspliced," "DST.vlApr07-unspliced" and "DST.iApr07"), 
yet the other probes are remapped to two variants only ("veejee.aApr07- 
unspliced" and "DST.vlApr07-unspliced"). For this probe set, the signifi- 
cant 2-dimensional mean structure of the data matrix could be resolved by 
proper remapping of the probes. 







Table 5 






A 


summary of the 


absolute values of 




in percentage b 


y probes 


227899_at 


Min. (%) 


Qi (%) 


Med. (%) 


Q3 (%) 


Max. (%) 


Probe 1 


0.06 


2.48 


5.03 


6.56 


10.92 


Probe 2 


0.01 


0.32 


0.64 


0.84 


1.39 


Probe 3 


0.04 


2.00 


4.04 


5.27 


8.78 


Probe 4 


0.39 


17.37 


35.20 


45.88 


76.40 


Probe 5 


0.07 


2.97 


6.02 


7.85 


13.06 


Probe 6 


0.04 


1.84 


3.72 


4.85 


8.08 


Probe 7 


0.22 


10.01 


20.29 


26.44 


44.02 


Probe 8 


0.04 


1.77 


3.59 


4.68 


7.79 


Probe 9 


0.03 


1.29 


2.62 


3.42 


5.69 


Probe 10 


0.11 


4.74 


9.61 


12.52 


20.85 


Probe 11 


0.17 


7.69 


15.59 


20.32 


33.83 
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Fig. 4. Scatterplot of singular vectors for the probe set "1560296-at. " The probe numbers 
are shown in the lower plot, and the dotted line is given by the least trimmed squares 
estimate. The circles in the upper plot represent the arrays hybridized by the samples from 
the colorectal adenocarcinomas, while the solid points represent the arrays hybridized by 
the samples from the normal colonic tissues. 



5.2. Example 2. In this example the data (http:/ /www. ncbi.nlm.nih.gov/ 
projects/geo/query/acc.cgi?acc=GSE8874) were collected in a recent exper- 
iment with the 2x2x2 factorial design, the detail of which is discussed in 
Leung et al. (2008). The three factors (with two levels each) are as follows: 

(i) mutation: mutant or wild type (WT); 

(ii) tissue: retinas or whole body; 

(hi) time: 36 or 52 hours post-fertilization. 

Under each condition, three Affymetrix zebrafish genome arrays are repli- 
cated, so we have 24 arrays in total. The vector ft is computed as in Example 
1 by assuming that the means in each tissue group are equal. Furthermore, 
we generate two directions and a 2 , used to reflect the possible tissue and 
mutation effects, respectively. In the study we still use PM as the inten- 
sity measure and carry out the singular value decomposition to get the two 
largest singular values as Ai and A2, where Ai > A2. We focus on 75 probe 
sets with the highest A|/Af (with all those ratios above 1/10), and use the 
X 2 test described in Section 3.2 on each of those probe sets. 

In this example 39 out of 75 probe sets are detected as individually sig- 
nificant, out of which 39 probe sets remain significant after the multiple test 
adjustment of Benjamini and Hochberg (1995). We shall describe one such 
probe set in detail. 
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5.2.1. Probe set "Dr.7506.1.Al_at." In the zebrafish genome array, the 
probe set "Dr.7506.1.Al_at" corresponds to gene "tuba812." The x 2 test 
gave the p- value of 2.37 x 10~ 5 , the adjusted p- value of 7.52 x 10~ 5 and 
the g-value of 4.83 x 1(T 6 . The first four singular values are (43142, 14839, 
2078, 1688). It is clear from Figure 5 that we cannot distinguish two tissue 
groups based on 8u, but the two groups are well separated by #2i- Further 
inspection of the data shows that the intensities of Probe 3 are linearly 
related with &u, but Qi% are linearly related with the intensities of Probe 15. 
From Table 6, we see that the information from Q<ii are clearly nonnegligible. 
Furthermore, we used BLAST to verify that all the probes are appropriate 
for Gene "tuba812," so there is strong evidence that the expression profile for 
Gene "tuba812" cannot be summarized by the usual unidimensional index 
across experimental conditions. In fact, the commonly used gene expression 
index would mask the clear differential expressions of the two tissue types. 

6. Conclusions. In this article we have proposed a new framework for 
testing the unidimensional mean structure of the probe-level data matrix. 
For most applications, we can carry out the tests discussed in the article 
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Fig. 5. Scatterplot of singular vectors for the probe set "Dr.7506.1.Al_at." The probe 
numbers are shown in the lower plot and the dotted line is a robust linear fit. The circles 
in the upper plot represent the arrays hybridized by the samples from retinas, while the 
solid points represent the arrays hybridized by the samples from whole body. 
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Table 6 

A summary of the absolute values of #2i02j/#ii0ij in percentage by probes 



Dr.7506.1.Al_at 


Min (%) 


01 (%) 


Med (%) 


03 (%) 


Max. 




Probe 1 


23.71 


40.44 


48.73 


58.58 


81 


25 


Probe 2 


20.13 


34.33 


41.37 


49.73 


68 


.97 


Probe 3 


7.76 


13.23 


15.94 


19.16 


26. 


.58 


Probe 4 


30.74 


52.42 


63.16 


75.94 


105 


.32 


Probe 5 


13.20 


22.51 


27.12 


32.60 


45, 


.22 


Probe 6 


7.95 


13.56 


16.33 


19.64 


'27 


.23 


Probe 7 


12.37 


21.10 


25.42 


30.56 


42 


.38 


Probe 8 


3.08 


5.25 


6.32 


7.60 


10 


.54 


Probe 9 


18.24 


31.10 


37.48 


45.05 


62 


48 


Probe 10 


27.56 


47.00 


56.63 


68.08 


94 


42 


Probe 11 


19.89 


33.92 


40.87 


49.13 


68 


14 


Probe 12 


26.07 


44.47 


53.58 


64.42 


89 


.34 


Probe 13 


11.71 


19.98 


24.07 


28.94 


40 


.13 


Probe 14 


33.25 


56.70 


68.32 


82.14 


113 


.92 


Probe 15 


38.84 


66.24 


79.81 


95.96 


133 


.08 


Probe 16 


39.66 


67.64 


81.50 


97.98 


135 


.88 



based on large sample approximations. We also proposed a model-based 
bootstrap algorithm to better control type I errors when the sample size is 
small. 

In two case studies, the proposed method detected genes whose expression 
levels were not well summarized by unidimensional indices. Through detailed 
inspection of the probe- level intensities of those genes, we found that the 
intensities of different probes can show different profiles across experimental 
conditions. In our investigation, we noticed that the following scenarios exist 
for the violation of a unidimensional gene expression summary: 

(1) A large percentage of probes that have poor binding strengths or low 
intensity measures in a probe set can mask the gene expression profiles. 

(2) One or more probes should be remapped to different variants of the 
same gene. 

(3) One or more probes are cross-hybridized. 

(4) An outlying and erroneous measurement is present for one of the 
probes. 

(5) The multiplicative model used to summarize gene expression is inad- 
equate even with all the probes well selected. 

It has been observed by Harbig, Sprinkle and Enkemann (2005) that out- 
lier signals on just one probe can seriously affect the calculations used for 
the subsequent analysis. While we do not always have definite answers as 
to the biological implications of such structures, our statistical analysis is 
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valuable in both nagging the potentially interesting and important probes 
and genes for further scientific investigations. Our approach does not lead 
directly to probe remapping, but may suggest candidates for possible alter- 
native mapping [Gautier et al. (2004); Lu et al. (2007)]. The bottom line 
is clear: if we solely rely on models that assume unidimensional gene ex- 
pressions, we might miss some of the complexities in gene expression data 
analysis. When a unidimensional model is shown to be inadequate, appro- 
priate actions, such as probe remapping, an alternative model or a different 
summarization method [e.g., Kapur et al. (2007)], are called for. 

Acknowledgments The authors thank Doctors Ping Ma and Sheng Zhong, 
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SUPPLEMENTARY MATERIAL 

Proofs of Main Results (DOI: 10. 1214/09- AOAS262SUPP; .pdf). We give 
a lemma on consistency, followed by the proofs for the theorems that are 
described in Sections 2 and 3. 
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